Introduction

In this project I will create a multi-class classification supervised learning model to predict product category (prodcat1) a customer is likely to order.

Table of content:

  1. Data Cleaning
    $\;\;\;\;\;\;$ 1.1 Handling null values
    $\;\;\;\;\;\;$ 1.2 Handling data types
    $\;\;\;\;\;\;$ 1.3 Data filtering
    2. EDA
    $\;\;\;\;\;\;$ 2.1 prodcat1 observation distribution in order file
    $\;\;\;\;\;\;$ 2.2 prodcat1 observation distribution in online file
    $\;\;\;\;\;\;$ 2.3 event2 observation distribution in online file
    $\;\;\;\;\;\;$ 2.4 total revenue distribution by prodcat1 in order file
    $\;\;\;\;\;\;$ 2.5 customers in both files
    $\;\;\;\;\;\;$ 2.6 top customers that generate majority of revenue
    $\;\;\;\;\;\;$ 2.7 customer segmentation
    3. Feature Engineering
    4. Feature Selection
    5. Modelling
    $\;\;\;\;\;\;$ 5.1 Random Forest Classifier
    $\;\;\;\;\;\;$ 5.2 SMOTE Random Forest Classifier
    $\;\;\;\;\;\;$ 5.3 Gradient Boosting Classifier
    6. Model Comparison
    7. Summary

1.Data Cleaning

1.1 Handling null values

there are null rows in column prodcat2 in order file. Since these are not lot, I will delete the null rows

lot of missing values in column event1. I will drop this feature from online file

1.2 Handling Data Types

1.3 Data Filtering

the dates are not aligned. I will filter the order data to match online data dates

2. EDA

2.1 prodcat1 observation distribution in order file

Takeaway: prodcat1 is our target variable. There are 7 distinct categories. The above distribution shows the data is imbalanced. There are around 60,000 observations for category 2. Category 5 and 7 have very low count of observations. The data need to be resampled for modelling

2.2 prodcat1 observation distribution in online file

Takeaway: Online data has only 3 categories (out of 7) and data is incomplete

2.3 event2 observation distribution in online file

Takeaway: 10 event2 categories and majority observations fall in category 7

2.4 Total revenue distribution by prodcat1 in order file

Takeaway: The revenue distribution by prodcat1 is similar to their observation distribution

2.5 Customers in both files

2.6 Top customers that generate majority of revenue

Takeaway: Out of 47,169 customers in order file 18,000 generates 80% of the total revenue i.e. 38% of customers generates 80% revenue

2.7 Customer Segementation

For customer segmentation we will create a RFM model. RFM studies customers’ behaviour and cluster them by using three metrics:


1. Recency (R): measure the number of days since the last purchase to a hypothetical snapshot day.
2. Frequency (F): measure the number of transaction made during the period of study
3. Monetary Value (M): measure how much money each customer has spent during the period of study.

Takeaway: The distributions of frequency and monetary value are extremely right skewed, especially for frequency. Its min, 25%, 50% and 75% percentile are all close to 1. These tells us customers are purchasing less frequently

Next, we start the log transformation on the RFM dataframe so that the metrics will be put on the same scale.

Without standardization, the clustering algorithm could be biased since monetary value might be much higher than the other two variables (recency and frequency are measured in days while monetary value is in dollars term.)

We will now fit the data into a clustering algorithm so that the machine could learn patterns from the three metrics and group them accordingly. We will use KMeans algorithm

Takeaway: we will use 3 as number of clusters

Takeaway: Cluste 1 are our most valued customers, they have shopped 23 times, which is way higher than the other two groups. They shopped 35 days ago, while customers in Cluster 2 shopped 415 days ago. And most importantly, Cluster 1 customers have spent US$1,731

3. Feature Engineering

We will be creating time based features. The focus will be to avoid data leak i.e not to use any future unknown data. Below are the feature explanation:
note - we choose 35 days since recency for cluster 1 is 35 days
1. browse_activity_last7days : browsing activity of the customer in last 7 days
2. max_browse_cat_last7days : max browsed prodcat1 category in last 7 days
3. max_browse_event_last7days : max browsed event2 category in last 7 days
4. count_browse_event_last7days : count of unique event2 categories browsed in last 7 days
5. browse_activity_last35days : browsing activity of the customer in last 35 days
6. max_browse_cat_last35days : max browsed prodcat1 category in last 35 days
7. max_browse_event_last35days : max browsed event2 category in last 35 days
8. count_browse_event_last35days : count of unique event2 categories browsed in last 35 days
9. order_activity_last7days : total order activity of the customer in last 7 days
10. count_unique_orders_last7days : count of unique orders in last 7 days
11. max_prodcat2_type_last7days : max ordered prodcat2 in last 7 days
12. count_unique_prodcat2_last7days : count of unique prodcat2 categories in last 7 days
13. total_rev_last7days : total revenue by customer in last 7 days
14. mean_rev_last7days : mean revenue by customer in last 7 days
15. order_activity_last35days : total order activity of the customer in last 35 days
16. count_unique_orders_last35days : count of unique orders in last 35 days
17. max_prodcat2_type_last35days : max ordered prodcat2 in last 35 days
12. count_unique_prodcat2_last35days : count of unique prodcat2 categories in last 35 days
13. total_rev_last35days : total revenue by customer in last 35 days
14. mean_rev_last35days : mean revenue by customer in last 35 days
15. day_of_week : numeric day of the week

4. Feature Selection

Correlation chart

Takeaway:
1.'max_prodcat2_type_last7days' and 'max_prodcat2_type_last35days' has the largest correlation with prodcat1
2. 'mean_rev_last7days' and 'mean_rev_last35days' are least correlated with prodcat1 - we will drop these columns

'total_rev_last35days' is highly correlated with 'order_activity_last35days' - we will drop 'total_rev_last35days'

5. Modelling

5.1 Random Forest Classifier

Takeaway: As we know the data is imbalanced and we have more observations for prodcat class 2 and thats why we see a higher recall and f1-score for prodcat class 2. On the other hand due to low observations of 5 and 7, we see low recall and f1-score

5.2 SMOTE (Synthetic Minority Over-Sampling Technique) with Random Forest

SMOTE is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement.

Takeaway: By applying oversampling technique the overall accuracy increased. The recall for minority class (5,7) also increased. SMOTE Random Forest perform better

5.3 Gradient Boosting Classifier

6. Model Comparison

Model Accuracy macro average precision macro average recall macro average f1 score
Random Forest Classifier 85.04% 84% 81% 82%
SMOTE Random Forest Classifier 85.78% 84% 84% 84%
Gradient Boosting Classifier 30.88% 10% 20% 13%

Takeaway: We will choose SMOTE Random Forest Classifier as it performs the best with increased recall and f1 score compared to imbalanced Random Forest model

7. Summary

In this exercise I performed and learned the following :
1. Our datasets are incomplete and imbalanaced - I tried to handle it by filtering and resampling the data.
2. I segmented the customers using concepts of recency, frequency and monetary value - There are lot of low buying customers. More than 20,000 customers just buy once in a year. We need different business strategies to handle each cluster
$\;\;\;\;\;\;$ Cluster 1: Improve Retention
$\;\;\;\;\;\;$ Cluster 0,2: Improve Retention + Increase Frequency
3. I created some timebased features and found out that prodcat2 feature has good correlation with label feature prodcat1. I used correlation matrix to select best features.
4. After running different classification models and using confusion matrix, we determined that Random Forest with SMOTE resampling technique best suits the data.

Next steps :
1. Better data collection - A balanced dataset where we have equal distribution across the prodcat1 classes both in order and online file.
2. Create more features at different time lags to provide more information to the model.
3. Due to time crunch, I subset the data and used only cluster 1 customers to model. A next version model should be able to handle all cluster customers.
4. I would like to try more complicated models like Neural Networks to enhance accuracy.
5. In this project I tried oversampling and in future I would like to look into undersampling techniques as well.